Message Passing Vs. Shared Address Space on a Cluster of SMPs 


Hongzhang Shan. Jaswinder Pal Singh 


Department of Computer Science 
35 Olden Street , Princeton University', Princeton , NJ 08544 
{shz , jps}@cs .princeton . eau 

Leonid Oliker 

National Energy Research Scientific Computing Center 
Mail Stop 50F, Lawrence Berkeley National Laboratory \ Berkeley, CA 94720 

lol iker@ lbl . gov 

Rupak Biswas 

Mail Stop T27A-1, NASA Ames Research Center, Moffett Field, CA 94035 
rbiswasOnas . nasa . gov 

September 24. 2000 


Abstract 

The convergence of scalable computer architectures using clusters of PCs ior PC-SMPsi with commodity net- 
working has become an attractive platform for high end scientific computing. Currently, message-passing and shared 
address space (SAS) are the two leading programming paradigms for these systems. Message-passing has been stan- 
dardized with MP1, and is the most common and mature programming approach. However, message-passing code 
development can be extreme!) difficult, especially for irregularly structured computations SAS offers substantial 
ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol 
overhead. In this paper, we compare the performance of and programming effort required for six applications under 
both programming models on a 32 CPC PC-SMP cluster. Our application suite consists of codes that typically do not 
exhibit high efficiency under shared memory programming, due to their high communication to computation ratios 
and complex communication patterns Results indicate that SAS can achieve about half the parallel efficiency of MP1 
for most of our applications: however, on certain classes of problems SAS performance is competitive with MPI. We 
also present new algorithms for improving the PC cluster performance of MPI collective operations. 


1 Introduction 

The convergence of scalable computer architectures using clusters of PCs with commodity networking has become 
an attractive platform for high-end scientific computing. Currently, message-passingand shared address space (SAS) 
are the two leading programming paradigms for these systems. Message-passing has been standardized w ith MPI [6]. 
and is the most common and mature programming approach. It provides both functional and performance porta- 
bility. However, message-passing code development can be extremely difficult, especially for irregularly structured 
computations. A coherent shared address space has been shown to be very effective at moderate-scale for a wide 
range of applications when supported efficiently in hardware. The automatic management of naming and coherent 
replication in this programming model also substantially eases the programming task compared to explicit message 
passing, especially for complex irregular applications that are naturally becoming increasingly popular as multipro- 
cessesing matures. This ease of programming can often be translated directly into performance gains [ 19, 201. Even as 
hardware -coherent machines replace traditional message passing systems at the high end. clusters of commodity' PCs 
and PC-SMPs have become increasingly popular for scalable computing. On these, the message passing programming 
model is dominant and the shared address space model unproven since it is implemented in softw are. Thus, especially 
given the ease of programming, it is important to understand the performance tradeoffs of message passing wuth the 
shared address space programming model on clusters. 

Approaches to support a shared address space in software across clusters differs not only in the specialization 
and efficiencies of networks but also in the granularities at which they provide coherence. Fine-grained software 
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coherence uses either code instrumentation [15. 5) for access control or commodity-oriented hardware support [22] 
with protocol implemented in software. Page-grained software coherence takes advantage of the virtual memory 
management facilities to provide replication and coherence at page granularity [14]. To alleviate false sharing and 
fragmentation problems, a relaxed consistency model is used to buffer coherence actions. Multiple writer protocols 
allow more than one processor to modify copies of a page locally and incoherently between the synchronization 
points, reducing the impact of write-write false sharing, and making the page consistent only when needed. Lu [7] 
compare the performance of the PVM and the TreadMarks page-based software shared memory library on an 8- 
processor network of ATM-connected workstations and on an 8-processor IBM SP-2. They find that TreadMarks 
generally performs a little worse. Karlsson and Brorsson [13] compared the characteristics of communication patterns 
in message passing and page-based software shared memory programs, using MPI and TreadMarks running on an 
IBM SP2. They found that the fraction of small messages in the TreadMarks executions lead to poor performance. 
However, the platforms they use are much low'er-performance and smaller scale and not SMP-based. The protocols 
are not high efficient. Recently, both the communication network and protocols for shared virtual memory have 
made great progress. Some gigabits-networks (per second) have been put into use. A new SYM 1 1 protocol called 
GeNIMA has also been developed for the page-grained shared address space on clusters. GeNIMA uses general- 
purpose network interface support to significantly reduce protocol overheads. It has been showm to perform quite well 
for moderate-scale systems on a fairly w ide range of applications: achieving at least half of the parallel efficiency of 
a high-end hardware-coherent system and often exhibiting better behavior [10. 1]. Thus, a study of comparing the 
performance of GeNIMA against message passing implementations of the same applications, w hich is the dominant 
way of programming applications for clusters today, becomes necessary and important. 

In this paper, we compare the performance of these two programming models using the best implementation 
available to us (MPl/Pro from MPI Softtech INC. and the GeNIMA SVM protocol for SAS) on a clusters of eight. 
4-way SMPs (for a total of 32 processors). The applications used are those that have been selected to compare the 
performance and programming ease on hardware-coherent platforms, including regular as well as dynamic irregular 
applications. Our application suite includes codes that scale well on tightly-coupled machines as well as codes that 
are challenging to obtain scalable performance on due to their high communication to computation ratios and complex 
communication patterns. With a couple of exceptions, they are generally applications w here developing efficient 
message passing implementations is not extremely difficult (even though it still takes a lot more work than the shared 
address space implementations). Porting these applications to the cluster did not require code modifications: how ever, 
some optimizations were performed to improve performance on the cluster platform [11. 18]. We find that while 
some classes of applications can achieve similar performance for MPI and SVM on the cluster, in most cases MPI 
performs significantly better than the shared address space model. The performance of SVM suffers greatlv from 
the protocol overhead, especially at the synchronization points, which often become the performance bottleneck. 
Further research into reducing the protocol overhead for SVM is required to achieve high performance. Some of 
the applications we have selected, such as 1-D FFT and radix sorting, are difficult to implement efficiently in either 
programming model on a cluster due to the limited bandw idth of the memory bus on the SMP nodes. This increases the 
challenge for scalable performance since the memory bus is also involved in communication. Some other, irregular and 
unpredictable applications are challenging to to implement efficiently in message passing but have low communication 
requirements, so for these the performance of the shared address space programming model is expected to be much 
better. For example, while we don't have a message passing implementation of this application, the speedups for the 
volrend volume rendering application is approximately 27 on 32 processors using the same platform w ith the GeNIMA 
protocol [10]. 

Currently, if very high performance is the goal, then the difficulty of MPI programming appears to be justified for 
commodity clusters of SMPs today. On the other hand, if ease of programming is important then SVM provides it 
at roughly a factor-of-tw'o cost in performance for many applications (and less for others) This may be considered 
encouraging for SVM, given the ease of programming advantages for complex applications as well as the difficult 
nature of our application suite and the relative maturity of the MPI library . Application-driven research into coherence 
protocols and extended hardw are support should reduce SVM and SAS overheads on future systems. 

We also present new algorithms for implementing MPI collective functions on our PC cluster platform. Results 
show that these techniques achieve a significant improvement compared the default MPl/Pro implementation. 

The rest of the paper is organized as follows Section 2 describes the platform w e used and the implementation of 
different programming models on it. Applications are discussed in Section 3. In Section 4. we analy ze the performance 
differences between the twx> programming models for each application. Section 5 explores new algorithms used to 
efficiently implement collective functions for MPI. Finally , w e present our conclusions in Section 6. 

1 1 The words SAS and SVM are used synony mously throughout this paper 



2 Platform and Programming Models 

The platform we used for our study is a cluster of 4-way Pentium Pro SMPs. Each node has 4 CPUs running at 
200MHz. Each processor has separate 8KB data and instruction LI caches and a unified 4-way set-associative 512KB 
L2 cache. Each node has 512MB main memory', running WINDOWS NT 4.0. The nodes are connected together 
either by Mvrinet [2] or Giganet [8]. The SAS and MP1 programming models are built on top of these two networks 
respectively. 

2.1 SAS Programming Model 

Much research has been done in the design and implementation of shared address space for clustered architectures, 
both at page granularity and at finer fixed granularities through code instrumentation. Among the most popular way 
to support a coherent shared address space in software on clusters, is page-based shared virtual memory (SVM). SVM 
takes advantage of the virtual memory management facilities to provide the replication and coherence at page gran- 
ularity. To alleviate false sharing and fragmentation problems, SVM uses the relaxed memory consistency model to 
buffer coherence actions such as invalidations or updates, and postpone them until a synchronization point. Multiple 
writer protocols are used to allow more than one processor to modify copies of a page locally and incoherently be- 
tween synchronization points, reducing the impact of write-write false sharing and making the page consistent only 
when needed by applying dif f s and write notices. Many different protocols have been developed which use dif- 
ferent timing strategies to propagate write notices and apply the invalidations to pages. Recently, a new protocol for 
SVM called GeNIMA has been developed and has shown good performance at moderate-scale systems for a fairly 
wide range of applications: achieving at least half of the parallel efficiency of a high-end hardw are-coherent system 
and often exhibiting much better behavior [10, 1]. It uses general-purpose network interface support to significantly 
improve protocol overheads. Thus, in this study we select the GeNIMA as our protocol for the SAS programming 
model. GeNIMA is built on top of the VMMC, a high-performance, user-level virtual memory' mapped communica- 
tion library [4]. VMMC itself, runs on top of the Mvrinet network. 

Each SMP node in our cluster is connected to a Mvrinet system area network via a PCI bus. The Mvrinet network 
interfaces are connected together through a single 16-w ay Mvrinet crossbar sw itch. thus minimizing contention in the 
interconnect. Each network interface has a 33MHz programmable processor and connects nodes to the network with 
two unidirectional links of 160 MB/s peak bandw idth each. The actual node-to-network bandw idth is constrained b\ 
the 133MB/s PCI bus. 

The parallelism constructs and calls needed bv the SAS programs are exactly the same as those used in our 
hardware-coherent platform implementation (SGI Origin 2000) [17, 18. 19]. This make portability trivial between 
these platforms. 

2.2 Message Passing Programming Model 

The message passing implementation used in this stud) is from MPI Software Technology Inc., developed directly 
on top of Giganet networks by the VIA [9] interface. By selecting MPI/Pro, instead of building our ow n MPI library 
from VMMC. we can compare the best known versions of both programming models. Thus our final conclusions are 
not affected by a poor implementation of the communication layer. Fortunately, the VIA interface and the VMMC 
have similar latency (Figure I ) and bandw idth (Figure 2) characteristics on our cluster platform. Giganet performs 
slightly better for short messages and w hile Myrinet has a small advantage for for larger messages. Thus, for message 
passing programs, there should be little performance difference for similar implementations across these two netw orks. 
Similarly to My rinet. the Giganet network interface is connected together by a single-Giganet crossbar sw itch. 



Figure 1: The latency of different message sizes for VMMC and VIA communication interface 
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Figure 2: The bandwidth of different message sizes for VMMC and VIA communication interface 


3 Applications 

Our application suite consists of codes used in previous studies to examine the performance and implementation 
complexity of various programming models on hardware -supported cache-coherent platforms. These codes include 
regular applications (FT, OCEAN and LU) and irregularly structured applications ( RADIX Sorting. SAMPLE Sorting 
and N-body). All six codes have either high communication to computation ratio or complex communication patterns, 
making scalable performance a difficult task on cluster platforms. FFT uses a non-localized but regular all-to-all 
personalized communication pattern to perform a matrix transposition; i.e. every process communicates w ith every 
other, sending different data across the netwwk. OCEAN exhibits primarily nearest neighbor patterns, but in a multi- 
grid. rather than a single-grid formation. RADIX sorting uses all-to-all personalized communication but in an irregular 
and scattered fashion. SAMPLE sorting also uses all-to-all personalized communication, however, the communication 
patterns are more regular than in RADIX sorting. LU uses one-to-many non-personalized communication. Finally. 
N-BODY requires all-to-all, all-gather communication and unpredictable send/recv communications patterns. These 
applications have shown high performance under both MPI and SAS, for reasonably large data sets on hardware- 
supported coherent platforms. 

Most of the MPI programs were been ported directly onto the cluster platform w ithout any changes. However. 
OCEAN and RADIX sorting required some changes for high-performance. In OCEAN, the matrix is partitioned by 
the row's instead of the blocks (see Figure 3). This allow s each processor to communicate w ith only its upper and low er 
neighbors, thus reducing the number of messages sent across the network, while improv ing the spatial locality of the 
communicated data. For RADIX, in the key exchange stage, each processor sends only one message to every other 
processor, containing all its chunks of keys that are destined for the destination processor. The destination processor 
then reorganize the data chunks to their correct positions. 



Block Wav Row tta 


Figure 3; The partition method for ocean for four processor case: block way \ > row way 

On hardware-supported coherent platform, each processor sends contiguously -destined chunk of keys directly as 
a separate message. This is done so that the data can immediately be placed into the correct position at the destination 
processor. As a result, multiple message are sent from one processor to every processor. While this method succeeds 
on the hardware-supported coherent systems, the modified approach is better suited for cluster platforms. On clus- 
ters systems, there is a performance gain w hen reducing the number of sent messages even though local computation 
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is increased. In order to study two-level architecture effect (intra node and inter nodei. we test our applications by 
reorganizing the communication sequence (intra-node first, inter-node first, intra-node and inter-node mixed). Inter- 
estingly, our results show that the MPI programs are insensitive to the communication sequence and this method of 
exploiting a two-level communication hierarchy, and the various communication sequences have almost no effect on 
the application performance. 

For the SAS codes, FFT, LU, and SAMPLE sorting, were ported without any modifications. In RADIX sorting, 
we used the improved version [18]. Here, instead of exchanging the keys in a scattered w ay, keys that are destined 
for the same processor are buffered together and then written to their destination. Several modification were applied 
to original version of OCEAN to improve performance [10] on the clusters. The matrix was partitioned by row s 
across processors instead of blocks, and significant changes were made to the data structures. The N-BODY code 
also required substantial modifications since the original version suffered from the high overhead of synchronizations 
used during the shared-tree building phase. A new r tree building method called bames-spatial has been developed to 
completely eliminate these expensive synchronization operations [16]. 

These applications have been previously used to evaluate the performance of different programming models on a 
hardware-supported cache-coherent platform. In that study, it w'as shown that SAS programs provide substantial ease 
of programming compared to message-passing implementations, and performance varies depending on the application 
but is often better for SAS as well. The ease of programming holds true on cluster systems, although some SAS 
code restructuring was required to improve performance. For example, in N-body the tree-building methodology has 
been changed from the original synchronization intensive scheme to spatial method. Nonetheless, the implementation 
is still easier than the message-passing approach, as has been argued earlier in the hard ware -coherent context [12], 
where it w'as showm that this implementation is valuable for high-end scalability even on hardware-coherent machines. 

A comparison between SAS and MPI programmability is presented in Table 1. Notice that SAS programs require 
fewer lines of essential code lines (excluding the initialization and debugging code and comments) compared w ith 
message passing. As application complexity (e.g. irregularity and dynamic nature) increases we see a more significant 
reduction of programming effort using SAS. 


I ! FFT 

LU 

OCEAN 

RADIX 

SAMPLE 

N-BODY 

SAS 

210 

309 

2878 

201 

450 

950 

MPI 


470 i 

4320 

384 

479 

1371 


Table 1: The number of the essential code lines needed by SAS and MPI for different applications. 


4 Performance Analysis 

In this section, we compare the performance of our applications using both programming paradigms. We first examine 
the speedup numbers, and then analyze the performance in more detail using time breakdowns. The speedups for dif- 
ferent programming models are based on the best sequential time, w ithout any parallel programming model overhead 
The total running time is divided into three components: LOCAL, RMEM, and SYNC. The LOCAL time includes the 
CPU computation time and the CPU waiting time for local cache misses. The RMEM time is the CPU time spent for 
remote communication and SYNC is the time spent for synchronization. We select two data sets for each application. 
First we examine a "basic” data set. at which the Shared Virtual Memory begins to perform reasonably "well” [10]. 
Next, we use a larger data set, since in general, increasing the problem size tends to improve many inherent program 
characteristics, such as load balance, communication to computation ratios, and spatiaJ locality. 

4.1 FFT 

FFT has very high communication to computation ratio, w hich diminishes only logarithmically w ith problem size. It 
needs a non-localized but regular all-to-all personalized communication pattern to perform the matrix transposition, 
and there is no overlap between the transposition and computation. It is much more difficult to achieve performance 
on the 1-dimensional FFT used here than on higher-dimensional FFTs. The speedups for MPI and SAS are presented 
in Figure 4. 

Both MPI and SAS did not achieve good scalability. Increasing the data set size helped but not significantly. 
This is mainly due to the pure communication transpose stage w hose communication to computation ratio does not 
change with problem size. It occupies only 16 9c of the total execution time in sequential run. However, the percentage 
increases to 507r for 32-processor run. Dealing with the scaling of the pure all-to-all communication is very' difficult. 
As the number of nodes that fetch data from each node increases, the contention in the network interface of the 
servicing node increases. At the same time, since these remote requests need to access the memory bus, the additional 
contention for the memory bus affects local memory access time as well. Also, the program suffers from the low 
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Figure 4: Speedups of FFT for SAS and MPI on 1 6 and 32 processors for 1 M and 4V1 data set size 


bandwidth of the memory bus on our commodity 4-way SMP nodes, which is a significant problem with commodity 
SMPs. High contention is caused when 4 processors work simultaneously within a node. For example, the LOCAL 
time (which includes local memory stall time) for 4M data set when using 2 processors is about 6s. This drops to only 
4.8s (compared to ideal of 3s), w hen 4 processors are used. 

Even though both programming models do not scale well for FFT, MPI significantly outperforms SAS. To better 
understand the performance difference. Figure 5 presents the breakdown for 4M data set size running on 32 processors. 
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Figure 5: The FFT time breakdown for SAS and MPI on 32 processors 4M data set size 

Surprisingly w*e find that all the three components (LOCAL, RMEM, and SYNC) times are much higher in SAS 
than in MPI. In order to maintain the page coherence, high protocol overhead is introduced in SAS programs, including 
computing the diffs, creating the timestamps, creating the write notices, and garbage collection. This dramatically 
increases the compute time, and also causes an increase in local cache misses for the application data, leading to 
higher LOCAL time. The diffs generated for the coherence w ill be immediately propagated to the home of the pages, 
thus increasing the network traffic and possibly causing more contention. At synchronization points, handling the 
protocol requirements highly dilates the synchronization interval, including the expensive invalidation of necessaiy 
pages. In the MPI program, all these protocol overheads do not exist. MPI does need to pack the data at the source 
and unpack them at the destination to make communication more efficient. However, this overhead is much smaller 
and completely local compared w ith the protocol overhead in SAS program. If the SAS program were structured such 
that each the sub-matrix transposed to a different processor is allocated separate!) (similar)} to MPI program), instead 
of all being allocated together in a shared data structure of a row' set or a full matrix (which is most natural due to the 
row -based partitioning of the computation ), the performance of the transpose could be improved. But this causes a lot 
of the programming ease of SAS to be given up, and in fact can make the row -w ise local FFTs complex to implement. 

4.2 OCEAN 

OCEAN exhibits a commonly used nearest neighbor pattern, but in a multi-grid rather than a single-grid formation. 
The communication to computation ratio is large for smaller problem sizes but diminishes rapid!) w ith increasing 
problem sizes. The speedups are showm in Figure 6. The speedups are still relativeh lower compared w ith those 
that we have achieved on the hardware-supported cache-coherent platform. Larger data sets improve performance, 
especially for the MPI programs. 
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Figure 6: Speedupsof OCEAN for SAS and MPI on 16 and 32 processors for 258x258 and 514x514 grid sizes 
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Figure 7: The OCEAN lime breakdown for SAS and MPI on ?2 processors 514x514 grid size 


The SAS program suffers from the expensive overhead of synchronization. This is clearly shown in Figure 7. After 
each nearest-neighbor communication, a synchronization operation is required in SAS to maintain coherence. Thus, 
there are many of thousands of barrier operations throughout OCEAN. Further analysis of the synchronization time 
show s that about 50^ of the synchronization time is spent waiting, and 50Cr of the time is for protocol processing [1]. 
Thus, the synchronization cost can be improved either by reducing protocol overhead or b> increasing the data set size. 
There is not enough computational work between the synchronization points for the 514 > 514 grid size, especially 
when this grid size is further coarsened into smaller grid sizes during program execution. However. OCEAN has very 
high memory' requirement due to use of more than twenty large data arrays. This prevents us from running larger data 
sets. In the MPI program, the synchronization is much cheaper since it is implicitly implemented in the send/receive 
pairs. 

4.3 LU 

LU uses one-to-manv non-personalized communication: the pivot block and the pivet row blocks are communicated 
to P roc essors each. The communication needs are relatively small compared with our other applications. Thus, 
we expect better performance for this application, as seen in Figure 8. From the time breakdown in Figure 9. we also 
see that most of the overhead is in LOCAL time. Here, the communication cost is very small. Further improvement 
can be achieved by reducing the synchronization cost, though this is mainly w ait time caused by load imbalance. 

Notice that for LU. the performance of SAS is very close to MPI since both of them have similar time breakdow ns. 
The protocol overhead running SAS program becomes less important for this application. This is because, unlike in 
FFT, the matrix is already organized in a 4-dimensional array to ensure that the blocks assigned to a processor are 
allocated locally and contiguously. Thus, each processor will only need to modify its ow n blocks w hich are allocated 
locally, and the modifications are immediately applied to the data pages. No diffs are generated and propagated to 
other nodes. This will greatly reduce the overhead of protocol processing. These performance results indicate that 
the two programming models do not show much performance difference, due to the relatively low communication 
requirements of LU. 


7 





Figure 8: Speedups of LU for SAS and MPI on 16 and 32 processors for 4096\4096 and 6144x6144 matrix sizes 
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4.4 RADIX 

The above three applications (FFT, OCEAN, and LU). are all show regular characteristics. The following three codes 
we investigate. RADIX, SAMPLE, and N-BODY, are irregularly structured. RADIX son uses all-to-all personalized 
communication but in an irregular and scattered fashion. It also has a very high communication to computation ratio 
that is independent of problem size and number of processors. This application has high bandwidth requirements 
for the memory bus, which is often not satisfied on current SMP platforms. Thus, high contention is caused on the 
memory' bus when 4 processors are used on a node. The "aggregate LOCAL** time across processors is much higher 
than the uniprocessor case. This leads to the poor performance shown in Figure 10. MPI still performs better than 
SAS. From the time breakdown in Figure 11, we can find that the RMEM time and SYNC time are much higher for 
SAS. This is due to similar reasons as discussed in the FFT subsection, as the communication is all-to-all in chunks 
here as well (albeit irregular). 



Figure 10: Speedups of RADIX for SAS and MPI on 16 and 32 processors for 4M and 32M integers 
The implementation of the all-to-all communication in the MPI Radix program on the cluster is different from that 
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Figure 1 1: The RADIX time breakdown for SAS and MPI on 32 processors for 32M integers 


used in the MPI version on a hardware-supported cache-coherent platform. In the cluster MPI program, each proces- 
sor sends only one message to every' other processor, containing all the chunks that are destined for the destination 
processor. The destination processor will then reorganize the data chunks to their correct positions. This is similar to 
the algorithm used in the NAS parallel applications IS [3]. But on the hardware -supported cache-coherent platform, 
each processor will send each contiguously destined chunk directly as a separate message so that the destination can 
put the data into the correct position, thus leading to multiple messages from each processor destined for every other 
processor. This is a tradeoff between computation and communication and depends on the cost of communication 
messages on the machines. On high-overhead and low -bandw idth clusters, using less messages is more important than 
the computation involved in gathering and scattering chunks. 

4.5 SAMPLE 

SAMPLE sorting also uses an irregular personalized all-to-all communication, but compared with RADIX sorting, 
it is more regular and the communication is much better structured. SAMPLE speedups are presented in Figure 12. 
Compared w'ith RADIX, the performance is much better. Notice that we use the same sequential time to compute the 
speedups for both RADIX and SAMPLE sorting. In SAMPLE sort, each processor does a local son on its partitioned 
data first using the radix sort, then performs the all-to-all communication to exchange ke>N. followed by another local 
sort on the newly-received data. In the sequential case, only one local sort is enough to sort all the keys. Thus, we can 
reasonably expect SAMPLE sort to achieve only 509* parallel efficiency. 



Figure 12: Speedups of SAMPLE for SAS and MPI on 16 and 32 processors for 4M and 32M integers 

If we look at the SAMPLE time breakdown in Figure 13, and compare w ith RADIX time breakdow ns in Figure 1 1, 
for both MPI and SAS, the RMEM time and SYNC time are greatly reduced. On the SGI 0rigm2000. we found that 
in most cases RADIX performs better than SAMPLE sort. However, on the cluster platform the opposite is true. 
SAMPLE sorting outperforms RADIX sorting. This result verifies that reducing messages is much more important on 
the cluster than increasing the computations (despite the large increase in local computation here). 

Note that the LOCAL time in SAMPLE sort is only slightly higher than in RADIY, even though much more 
computation is performed in SAMPLE. This means that the contention on the memory bu^ for RADIX sorting is much 
higher than for the SAMPLE sort, due to the more sequential memory access patterns. 
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Figure 13: The SAMPLE time breakdown for SAS and MPI on 32 processors for 32M integers 


4.6 N-BODY 

Finally, we examine the performance of the N-BODY application. From Figure 14, we find that MPI still performs 
better than SAS, especially with the increase of data sets. For 128K bodies and 32 processors. MPI achieves a speedup 
of 27 compared to speedup of 15 using SAS. The time breakdown for 128K data set on 32 processors is shown in 
Figurel5. SAS has higher SYNC time and RMEM time, but the SYNC time is much higher and dominates. This is 
because at each synchronization point, many diffs and write notices need to be processed. Also, many shared pages 
have to be invalidated. So the synchronization wait time is further dilated due to serialization. Further analysis show's 
that 829r of the barrier time is spent on protocol handling. This expensive synchronization problem occurs in all of our 
applications except LU, and the performance of SVM programs suffer heavily from it. Further research on the SVM 
coherence protocol should focus on reducing the synchronization cost. Possible approaches include applying the diffs 
before the synchronization points, moving invalidating shared pages out of synchronization points, and increasing 
hardware supports. 



Figure 14: Speedups of N-BODY for SAS and MPI on 1 6 and 32 processors for 32K and 1 28K bodies 

Unlike our other five applications, the MPI version of N-BODY has a higher time LOG .AT than the SAS code. This 
is due to the use of different high level algorithms for each programming model. In the SAS program, a shared tree 
is built and each processor builds one part of it; w hile in MPI program, a locally essential tree is used. Building the 
locally essential tree is quite expensive and involves a lot of more computation. A processor has to first build a local 
tree using those bodies partitioned to it. then it computes the nodes needed by every other processors from its local tree 
and sends those nodes to their destination. After receiving all the necessary' nodes for the ensuing force calculation, 
it has to add them into its local tree and generate the locally essential tree. Thus, a lot of computation overhead has 
been introduced into the MPI version for building locally essential trees. With the increase of larger data sets, these 
differences in building the tree become less important and the force calculation phase begins to dominate. 


5 Implementation of Collective Functions for MPI 

An interesting question for clusters, and particularly hybrid clusters of SMPs. is how to structure collective commu- 
nication. In the MPI library, the communication functions can be divided into three categories: the basic send/receive 
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Figure 15: The N-BODY time breakdown for SAS and MPI on 32 processors for I28K bodies 

functions, collective functions, and other operations. The performance of the basic send/receive are mainly depen- 
dent on the underlying communication hardware and low-level software. However, the performance of the collective 
functions are affected by their implementation algorithms. Research in this area has been performed for several other 
platforms [21]. In this section, we discuss the algorithms suitable for our platform: a clusters of 4-way SMPs. Specif- 
ically, we explore the algorithms for two collective functions, MPI_Allreduce and MPI _A11 gather that are used in 
our applications. Here, we call the MPI/Pro implementation the “original’' (the exact algorithms used are not well 
documented) and use it as our baseline. 

5.1 MPI_Allreduce 

The most commonly used algorithm is the binary tree (B-Tree) which is shown in Figure 16. The structure of our 
4-way SMP nodes leads us to change the lowest level of the tree to a four-way structure (called B-Tree-4). And within 
a node, the communication can either be implemented by shared memory or basic MPI send/receive functions. We 
observe no difference in performance between these two in-node approaches for the collective communication. The 
result for reducing a double variable is shown in Table 2. The binary tree algorithm performs similarly to the MPI/Pro 
implementation. The B-Tree-4 algorithm performs somew hat better. 
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Figure 16: The algorithms used to implement the MPl-Allreduce using two nodes as an example 


Algorithm 

Original 

B-Tree 

B-Tree-4 

Time (us) 

1117 

1035 

9ST - 


Table 2: The time needed for MPl_Allreduce on 32 processors (8 nodes > for different algorithms 


5.2 MPI_Allgather 

We explored several algorithms for this case. The first method is the binary tree method, and the second is the B-Tree-4 
approach. To understand their behavior, we examine level 0 and level 1 (see Figure 16). In the B-Tree-4 algorithm. 



after the level-0 processor collects all the data, it broadcasts all the collected data to level- 1 processors, then from level- 
1 to level-2 and so on. At this time, each processor in level 1 already owns those data collected from the subtree rooted 
at it. So it is not necessary to broadcast all the data back to them. Instead, those processors in level 1 can immediately 
exchange data themselves. There are two processors left in level 1. These two can just exchange data themselves. 
We call this algorithm B-Tree-4-New. We can further extend this idea to level-2. In the B-Tree-4-New algorithm, 
the four processors in level-2 exchange data amongst themselves. Each processor needs two send/receive function 
calls. This method is called B-Tree-4-New-2. In Table 3, we present the time needed for the different algorithms to 
perform the allgather function for IK-integer and 1 -integer messages. We found that using the new algorithm, the 
performance on our platform has been greatly improved, indicating that the original MPI-Pro algorithm doesn't use 
the optimization of sending down only the necessary data and performing a data exchange. However, since most of 
the remote communication time in the applications was spent on the send/receive functions, the overall performance 
was only slightly improved. 


I! Algorithm 1 

Original 

B-Tree 

B-Tree -4 

B-Tree-4-New 

! B-Tree-4-New-2 



—MUM 

1479 

1395 

i 1124 




993 

994 

975 


Table 3: The time needed for MPl_AUgather on 32 processors (8 nodes) for different algorithms using 1 k and 1 integers 


6 Conclusion 

In this paper, we studied the performance of and programming effort for six applications using message-passing 
and SAS programming on a 32 CPU PC cluster. The system consisted of eight 4-w'ay Pentium Pro SMPs running 
WINDOWS NT 4.0. To create a fair comparison between the two programming methodologies, we used the best 
known implementations of the communication libraries. The message-passing version of MPI/Pro is implemented 
directly on top of the Giganet network by the VIA interface, which the Giganet network interface implements in 
hardware. The SAS implementation is a shared virtual memory (SVM) implementation and uses the GeNIMA protocol 
over the VMMC low -level communication library , w hich is implemented in firmware and software on the Myrinet 
network. Experiments showed that VIA and VMMC have similar latency and bandwidth characteristics on the cluster 
platform. The GeNIMA protocol has been developed for page-grained shared address space on clusters, and uses 
general purpose network interface support to significantly reduce protocol overhead. Our application suite consists of 
several codes that are challenging for scalable performance due to their high communication to computation ratios and 
complex communication patterns. 

Three regular applications (FFT. OCEAN, and LU) and three irregularly structured codes (R.AD1X, SAMPLE, 
and N-BODY) were presented. Porting these applications did not require code modifications: however, some opti- 
mizations were performed to improve performance on the cluster platform. Changes included reducing the number 
of messages in the message-passing versions, and removing fine-grained synchronizations from SAS codes. FFT. 
OCEAN, RADIX, and SAMPLE did not scale well under both programming models due to their high communication 
to computation ratios and/or the limited bandwidth of the memory bus on the 4- way SMP nodes (and the resulting con- 
tention between communication and local computation). LU and N-BODY showed better performance characteristics 
due to lower communication-to-computation ratios. 

Overall, SVM provides a substantial ease of programming, especially for the more complex applications which 
are irregular or dynamic in nature. However, unlike in a previous study for hardware-coherent machines where the 
shared address space implementations were also performance-competitive w ith MPrrdespite all the research in SVM 
protocols and communication libaries in the last several years SVM achieved only about half the parallel efficiency 
of MPI for most of our applications. LU was an exception, in which the SVM implementation achieved very similar 
performance to the MPI version. The higher runtimes of the SVM versions were due to high cost of the protocol 
overhead associated with maintaining page coherence and implementing synchronizations. These costs include: com- 
puting diffs, creating timestamps, generating write notices, and performing garbage collection. Thus, if very high 
performance is the goal, then the difficulty of MPI programming appears to be justified for commodity clusters of 
SMPs today. On the other hand, if ease of programming is important then SVM provides it at roughly a factor-of-two 
cost in performance for many applications (and less for others). This may be considered encouraging for SVM. given 
the ease of programming advantages for complex applications as well as the difficult nature of our application suite and 
the relative maturity of the MPI library. Application-driven research into coherence protocols and extended hardware 
support should reduce SVM and SAS overheads on future systems. 

Finally, we presented new algorithms for implementing MPI collective functions on our PC cluster platform. 
Results show that some of these techniques achieve a significant improvement compared the default MPI/Pro imple- 
mentation. 
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